The SQL code tutorials for Db2 rely on a Jupyter notebook extension, commonly refer to as a "magic" command. The beginning of all of the notebooks begin with the following command which will load the extension and allow the remainder of the notebook to use the %sql magic command.
%run db2.ipynbThe cell below will load the Db2 extension. Note that it will take a few seconds for the extension to load, so you should generally wait until the "Db2 Extensions Loaded" message is displayed in your notebook. In the event you get an error on the load of the ibm_db library, modify the command to include the -update option:
run db2.ipynb -update
In [ ]:
%run db2.ipynb
Before any SQL commands can be issued, a connection needs to be made to the Db2 database that you will be using. The connection can be done manually (through the use of the CONNECT command), or automatically when the first %sql
command is issued.
The Db2 magic command tracks whether or not a connection has occured in the past and saves this information between notebooks and sessions. When you start up a notebook and issue a command, the program will reconnect to the database using your credentials from the last session. In the event that you have not connected before, the system will prompt you for all the information it needs to connect. This information includes:
There will be default values presented in the panels that you can accept, or enter your own values. All of the information will be stored in the directory that the notebooks are stored on. Once you have entered the information, the system will attempt to connect to the database for you and then you can run all of the SQL scripts. More details on the CONNECT syntax will be found in a section below.
The next statement will force a CONNECT to occur with the default values. If you have not connected before, it will prompt you for the information.
In [ ]:
%sql CONNECT
The Db2 extension is made up of one magic command that works either at the LINE level (%sql
) or at the CELL level (%%sql
). If you only want to execute a SQL command on one line in your script, use the %sql form of the command. If you want to run a larger block of SQL, then use the %%sql
form. Note that when you use the %%sql
form of the command, the entire contents of the cell is considered part of the command, so you cannot mix other commands in the cell.
The following is an example of a line command:
In [ ]:
%sql VALUES 'HELLO THERE'
If you have SQL that requires multiple lines, of if you need to execute many lines of SQL, then you should
be using the CELL version of the %sql
command. To start a block of SQL, start the cell with %%sql
and do not place any SQL following the command. Subsequent lines can contain SQL code, with each SQL statement delimited with the semicolon (;
). You can change the delimiter if required for procedures, etc... More details on this later.
In [ ]:
%%sql
VALUES
1,
2,
3
If you are using a single statement then there is no need to use a delimiter. However, if you are combining a number of commands then you must use the semicolon.
In [ ]:
%%sql
DROP TABLE STUFF;
CREATE TABLE STUFF (A INT);
INSERT INTO STUFF VALUES
1,2,3;
SELECT * FROM STUFF;
The script will generate messages and output as it executes. Each SQL statement that generates results will have a table displayed with the result set. If a command is executed, the results of the execution get listed as well. The script you just ran probably generated an error on the DROP table command.
Both forms of the %sql
command have options that can be used to change the behavior of the code. For both forms of the command (%sql
, %%sql
), the options must be on the same line as the command:
%sql -t ... %%sql -t
The only difference is that the %sql
command can have SQL following the parameters, while the %%sql
requires the SQL to be placed on subsequent lines.
There are a number of parameters that you can specify as part of the %sql
statement.
Multiple parameters are allowed on a command line. Each option should be separated by a space:
%sql -a -j ...
A SELECT statement will return the results as a dataframe and display the results as a table in the notebook. If you use the assignment statement, the dataframe will be placed into the variable and the results will not be displayed:
r = %sql SELECT * FROM EMPLOYEE
The sections below will explain the options in more detail.
The default delimiter for all SQL statements is the semicolon. However, this becomes a problem when you try to create a trigger, function, or procedure that uses SQLPL (or PL/SQL). Use the -d option to turn the SQL delimiter into the at (@
) sign and -q to suppress error messages. The semi-colon is then ignored as a delimiter.
For example, the following SQL will use the @
sign as the delimiter.
In [ ]:
%%sql -d -q
DROP TABLE STUFF
@
CREATE TABLE STUFF (A INT)
@
INSERT INTO STUFF VALUES
1,2,3
@
SELECT * FROM STUFF
@
The delimiter change will only take place for the statements following the %%sql
command. Subsequent cells
in the notebook will still use the semicolon. You must use the -d option for every cell that needs to use the
semicolon in the script.
The default number of rows displayed for any result set is 10. You have the option of changing this option when initially connecting to the database. If you want to override the number of rows display you can either update the control variable, or use the -a option. The -a option will display all of the rows in the answer set. For instance, the following SQL will only show 10 rows even though we inserted 15 values:
In [ ]:
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
You will notice that the displayed result will split the visible rows to the first 5 rows and the last 5 rows. Using the -a option will display all values:
In [ ]:
%sql -a values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
To change the default value of rows displayed, you can either do a CONNECT RESET (discussed later) or set the Db2 control variable maxrows to a different value. A value of -1 will display all rows.
In [ ]:
# Save previous version of maximum rows
last_max = _settings['maxrows']
_settings['maxrows'] = 5
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
A special note regarding the output from a SELECT statement. If the SQL statement is the last line of a block, the results will be displayed by default (unless you assigned the results to a variable). If the SQL is in the middle of a block of statements, the results will not be displayed. To explicitly display the results you must use the display function (or pDisplay if you have imported another library like pixiedust which overrides the pandas display function).
In [ ]:
# Set the maximum back
_settings['maxrows'] = last_max
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
In [ ]:
%%sql
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;
If you know that these errors may occur you can silence them with the -q option.
In [ ]:
%%sql -q
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;
SQL output will not be suppressed, so the following command will still show the results.
In [ ]:
%%sql -q
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;
VALUES 1,2,3;
The %sql
syntax allows you to pass local variables to a script. There are 5 predefined variables defined in the program:
These variables are all part of a structure called _settings. To pass a value to a LINE script, use the braces {} to surround the name of the variable:
{_settings["database"]}
The next line will display the currently connected database.
In [ ]:
%sql VALUES '{_settings["database"]}'
You cannot use variable substitution with the CELL version of the %%sql
command. If your SQL statement extends beyond one line, and you want to use variable substitution, you can use a couple of techniques to make it look like one line. The simplest way is to add the backslash character (\
) at the end of every line. The following example illustrates the technique.
In [ ]:
empno = '000010'
%sql SELECT LASTNAME FROM \
EMPLOYEE \
WHERE \
EMPNO = '{empno}'
The other option for passing variables to a %sql
or %%sql
statement is to use the embedded variable format. This requires that the variable be prefixed with a colon (:
) in front of it. When using this format, you do not need to use quote characters around the variables since its value is extracted at run time. The first example uses the value of the variable.
In [ ]:
empno = '000010'
%sql select lastname from employee where empno='{empno}'
This example uses the embedded variable name (:empno
).
In [ ]:
%sql select lastname from employee where empno=:empno
Sometimes you want to see how the execution of a statement changes with the addition of indexes or other optimization changes. The -t option will run the statement on the LINE or one SQL statement in the CELL for exactly one second. The results will be displayed and optionally placed into a variable. The syntax of the command is:
sql_time = %sql -t SELECT * FROM EMPLOYEEFor instance, the following SQL will time the VALUES clause.
In [ ]:
%sql -t VALUES 1,2,3,4,5,6,7,8,9
When timing a statement, no output will be displayed. If your SQL statement takes longer than one second you will need to modify the db2 _runtime variable. This variable must be set to the number of seconds that you want to run the statement.
In [ ]:
_runtime = 5
%sql -t VALUES 1,2,3,4,5,6,7,8,9
In [ ]:
%%sql
VALUES
'{
"empno":"000010",
"firstnme":"CHRISTINE",
"midinit":"I",
"lastname":"HAAS",
"workdept":"A00",
"phoneno":[3978],
"hiredate":"01/01/1995",
"job":"PRES",
"edlevel":18,
"sex":"F",
"birthdate":"08/24/1963",
"pay" : {
"salary":152750.00,
"bonus":1000.00,
"comm":4220.00}
}'
Adding the -j option to the %sql (or %%sql) command will format the first column of a return set to better display the structure of the document. Note that if your answer set has additional columns associated with it, they will not be displayed in this format.
In [ ]:
%%sql -j
VALUES
'{
"empno":"000010",
"firstnme":"CHRISTINE",
"midinit":"I",
"lastname":"HAAS",
"workdept":"A00",
"phoneno":[3978],
"hiredate":"01/01/1995",
"job":"PRES",
"edlevel":18,
"sex":"F",
"birthdate":"08/24/1963",
"pay" : {
"salary":152750.00,
"bonus":1000.00,
"comm":4220.00}
}'
Sometimes it would be useful to display a result set as either a bar, pie, or line chart. The first one or two columns of a result set need to contain the values need to plot the information.
The three possible plot options are:
The following data will be used to demonstrate the different charting options.
In [ ]:
%sql values 1,2,3,4,5
Since the results only have one column, the pie, line, and bar charts will not have any labels associated with them. The first example is a bar chart.
In [ ]:
%sql -pb values 1,2,3,4,5
The same data as a pie chart.
In [ ]:
%sql -pp values 1,2,3,4,5
And finally a line chart.
In [ ]:
%sql -pl values 1,2,3,4,5
If you retrieve two columns of information, the first column is used for the labels (X axis or pie slices) and the second column contains the data.
In [ ]:
%sql -pb values ('A',1),('B',2),('C',3),('D',4),('E',5)
For a pie chart, the first column is used to label the slices, while the data comes from the second column.
In [ ]:
%sql -pp values ('A',1),('B',2),('C',3),('D',4),('E',5)
Finally, for a line chart, the x contains the labels and the y values are used.
In [ ]:
%sql -pl values ('A',1),('B',2),('C',3),('D',4),('E',5)
The following SQL will plot the number of employees per department.
In [ ]:
%%sql -pb
SELECT WORKDEPT, COUNT(*)
FROM EMPLOYEE
GROUP BY WORKDEPT
The final option for plotting data is to use interactive mode -i
. This will display the data using an open-source project called Pixiedust. You can view the results in a table and then interactively create a plot by dragging and dropping column names into the appropriate slot. The next command will place you into interactive mode.
In [ ]:
%sql -i select * from employee
Many of the Db2 notebooks depend on two of the tables that are found in the SAMPLE database. Rather than having to create the entire SAMPLE database, this option will create and populate the EMPLOYEE and DEPARTMENT tables in your database. Note that if you already have these tables defined, they will not be dropped.
In [ ]:
%sql -sampledata
In [ ]:
%sql select * from employee fetch first 3 rows only
You can assign the result set directly to a variable.
In [ ]:
x = %sql select * from employee fetch first 3 rows only
The variable x contains the dataframe that was produced by the %sql
statement so you access the result set by using this variable or display the contents by just referring to it in a command line.
In [ ]:
x
There is an additional way of capturing the data through the use of the -r
flag.
var = %sql -r select * from employeeRather than returning a dataframe result set, this option will produce a list of rows. Each row is a list itself. The rows and columns all start at zero (0), so to access the first column of the first row, you would use var[0][0] to access it.
In [ ]:
rows = %sql -r select * from employee fetch first 3 rows only
print(rows[0][0])
The number of rows in the result set can be determined by using the length function.
In [ ]:
print(len(rows))
If you want to iterate over all of the rows and columns, you could use the following Python syntax instead of creating a for loop that goes from 0 to 41.
In [ ]:
for row in rows:
line = ""
for col in row:
line = line + str(col) + ","
print(line)
Since the data may be returned in different formats (like integers), you should use the str() function to convert the values to strings. Otherwise, the concatenation function used in the above example will fail. For instance, the 6th field is a birthdate field. If you retrieve it as an individual value and try and concatenate a string to it, you get the following error.
In [ ]:
print("Birth Date="+rows[0][6])
You can fix this problem by adding the str function to convert the date.
In [ ]:
print("Birth Date="+str(rows[0][6]))
As mentioned at the beginning of this notebook, connecting to Db2 is automatically done when you issue your first
%sql
statement. Usually the program will prompt you with what options you want when connecting to a database. The other option is to use the CONNECT statement directly. The CONNECT statement is similar to the native Db2
CONNECT command, but includes some options that allow you to connect to databases that has not been
catalogued locally.
The CONNECT command has the following format:
%sql CONNECT TO <database> USER <userid> USING <password | ?> HOST <ip address> PORT <port number>If you use a "?" for the password field, the system will prompt you for a password. This avoids typing the password as clear text on the screen. If a connection is not successful, the system will print the error message associated with the connect request.
If the connection is successful, the parameters are saved on your system and will be used the next time you run a SQL statement, or when you issue the %sql CONNECT command with no parameters.
If you want to force the program to connect to a different database (with prompting), use the CONNECT RESET command. The next time you run a SQL statement, the program will prompt you for the the connection and will force the program to reconnect the next time a SQL statement is executed.
In [ ]:
%sql CONNECT RESET
In [ ]:
%sql CONNECT